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Abstract 

Use of socially generated "big data" to access information about collective states of the minds in human 
societies becomes a new paradigm in the emerging field of computational social science. One of the 
natural application of this would be prediction of the society's reaction to a new product in the sense of 
popularity and adoption rate. However, bridging between "real time monitoring" and "early predicting" 
remains as a big challenge. Here, we report on an endeavor to build a minimalistic predictive model for the 
financial success of movies based on collective activity data of online users. We show that the popularity 
of a movie could be predicted well in advance by measuring and analyzing the activity level of editors 
and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia. 



Introduction 

Living in the digital world of today, along with all the advantages also has its side effects and byproducts. 
Our daily life nowadays leaves a digital trace of all our activities in the recently developed Information and 
Communications Technology based environments. Our social communications through different digital 
channels, financial activities within e-commerce, physical locations registered by cell phone providers, 
and many other of this kind are traced and recorded in a rather passive way. On top of that, we also 
actively share information about our feelings, emotional moods, opinions and views through the so called 
Web 2.0. or user generated content within social media. In addition to providing us with novel answers 
to classic questions about individual and social aspects of human life from scientific point of view, precise 
analysis of this huge amount of data could have practical applications to predict, monitor, and cope 
with many different type of events, from simple matters of daily life to massive crises in the global 
scale. For example, Sakaki et al. have developed an alerting system based on Tweets (posts in the 
Twitter microbloging service), being able to detect earthquakes almost in real time [l]. They elaborate 
their detection system further more to detect rainbows in the sky, and traffic jams in cities (2). The 
practical point of their work is that the alerting system could perform so promptly that the alert message 
could arrive faster than the earthquake waives to certain regions. Bollen et al. have analyzed moods of 
Tweets and based on their investigations they could predict daily up and down changes in Dow Jones 
Industrial Average values with an accuracy of 87.6% [3]. Saavedra et al. investigated the relationship 
between the content of traders' messages and market dynamics. They show that there is a positive 
correlation between the usage of "bundles" of positive and negative words with agents' overall financial 
performance [4j. Another example is using Twitter to predict electoral outcomes [5], however with its 
biases and limitations [6j[7] . Interesting studies have appeared treating the use of social media indicators 
to predict the scientific impact of research articles, e.g., short-term web usage (number of downloads from 
the pre-print sharing web site "arXiv") [8j and Twitter mentions j9j. In a recent work, it is shown that 
Twitter mentions and arXiv downloads follow two distinct temporal patterns of activity, however, the 



volume of Twitter mentions is statistically correlated with arXiv downloads and early citations 10 . Preis 
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et al. found a correlation between weekly transaction volumes of "S&P 500 companies" and weekly Google 
search volumes of corresponding company names 11 . By analyzing search queries for information about 



preceding and following years, a "striking" correlation between a country's GDP and the predisposition 
of its inhabitants to look forward is observed 
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Based on Google search logs, Ginsberg et al. estimated 



the spread of influenza in the United States 

Statistical analysis of motion picture markets led to intriguing results, such as observing the evidence 
for a Pareto law for movie income [14] along with a log- normal distribution of the gross income per theater 
and a bimodal distribution of the number of theaters in which a movie is shown |15| . Despite much effort 
with different approaches, predicting the financial success of a movie remains a challenging open problem. 
For example, Sharda and Delen have trained a neural network to process pre-release data, such as quality 
and popularity variables, and classify movies into nine categories according to their anticipated income, 
from "flop" to "blockbuster". For test samples, the neural network classifies only 36.9% of the movies 
correctly, while 75.2% of the movies are at most one category away from correct [16) . Joshi et al. have 
built a multivariate linear regression model that joined metadata with text features from pre-release 
critiques to predict the revenue with a coefficient of determination R 2 = 0.671 17 . While predictions 



based on classic quality factors fail to reach a level of accuracy high enough for practical application, 
usage of user-generated data to predict the success of a movie becomes a very tempting approach. Ishii 
et al. present a mathematical framework for the spread of popularity in society 18 . Their model, which 
takes the advertisement budget as an input parameter and generates a dynamic popularity variable, is 
validated against the number of Weblog posts on the particular movies in the Japanese Blogosphcrc. 
In other words they consider the activity level of the bloggers as a representative parameter for social 
popularity. However, Mashine and Glances by analysing the sentiment of Weblog stories on movies 
emphasize that the correlation between pre-release sentiment and sales is not at the an adequate level 
to build up a predictive model based on 19 . In a very interesting approach Asur and Huberman set 

They 
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up a prediction system for the revenue of movies based on the volume of Twitter mentions 
achieve an adjusted coefficient of determination of 0.97 at the night before the movie release for the first 
weekend revenue of a sample of 24 movies. In a later work, however, Wong et al. show that Tweets do 



not necessarily represent the financial success of movies 21 . They consider a sample of 34 movies and 



compare the Tweets about the movies to evaluation of users in movie review web sites. They argue that 
predictions based on social media could have high precision but low recall. 

Wikipedia, as a predominant example of user generated media, has been vastly studied from different 
points of view. Its size and growth 22 -24 , topical coverage and notability of entries 25 - 27 , conflict 



and editorial wars among users 28 -32 , editorial patterns 33 and linguistic features 34 are only few 



examples of research topics associated with Wikipedia. We are ware of two comprehensive reviews 35 36 



and a brief hand-on guide to some of the most recent Wikipedia research 37 



Although effects of external events on the activity of Wikipedia editors 38 39 and the number of 
page views 40 41 have been studied in details, usage of Wikipedia as a source of information to detect 

in which they 
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and predict events in real world has been limited to the work by Osborne et al. 
used Wikipedia page views to fine-filter the outcome of their algorithm for Twitter-based "first story 
detection" . 

In this work we consider both the activity level of editors and the page views to assess the popularity 
of a movie. We define different predictor variables and apply a linear regression model to forecast the 
first weekend box office revenue of a set of 312 movies that were released in the United States in 2010. 
Our analysis not only outperforms the previous works by large number of movies under investigation, 
but also elaborates the state of the art by providing reasonable predictions as early as one month prior 
to the release date of the movie. Finally, our statistical approach, free of any language based analysis, 
e.g., sentiment analysis, could be easily generalized to non-English speaking movie markets or even other 
kinds of products. 
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Results 

According to data from Box Office Mojo, there were 535 movies that were screened in the United States 
in 20f0 (see the Methods section). We could track the corresponding page in Wikipedia for 312 of them. 
A closer look at the history of these 3f 2 articles shows that many of them are created a lot earlier than 
the release date of the movie (Fig. [ijA)). This enables us to follow the popularity of the movie much 
in advance. To estimate the popularity, we followed four activity measures; V: Number of views of the 
article page, U: Number of users, being the number of human editors that contributed to the article, E: 
Number of edits made by human editors on the article, and R: Collaborative rigor (or simply rigor |43| ) 
of the editing train of the article. To have a consistent time framework, we set the release time of the 
movie as t = 0. For more details see the Methods section. Examples of the daily increments of number of 
views and number of users are shown in Supplementary Fig. SI. The daily increments of both variables 



rise and fall around the day of release similarly to observations by fshii et al. 18 . In addition to these, 
an essential parameter for predicting the movie revenue is the number of theaters that screen the movie 
T, which is included in our set of parameters. The complete dataset including the financial data as well 
as Wikipedia activity records is available via the Supplementary Data SI. To have an overall image 
of the sample, histograms of the accumulated values of 4 activity parameters from the first edit on the 
article up to 7 days after release, along with the first weekend box office revenue, and the number of 
theaters screening the movie are depicted in Fig.QB-F). It is clear that revenues among the sample have 
a bimodal distribution (Fig. [ljB)). This is in accord with 15 . It also shows that Wikipedia coverage 
is not limited only to financially successful movies. Considerable amount of activities on Wikipedia 
articles, (Fig. [ljD-G)) indicates the richness of the data. However, before building a regression model, 
the correlations between the activity parameters and box office revenue should be examined first. 

The Pearson correlation coefficient rj (t) between the accumulated value of the j-th predictor variable 
from inception of the article up to time t before the movie release and the box office revenue y is calculated 
as 

r j i s = (xj(t)y) - (x 3 (t))(y) 

^(x](t))-( X] (t))^(y2)-(y)i' 

with (.) indicating average over the whole sample. Temporal correlations are shown in Fig. [2j For all 
predictors the correlation coefficient gradually increases as time approaches the day of release and around 
the day of release, correlation suddenly rises. Note that V shows the highest correlation with the revenue 
prior to the release pf movies. 

We build a multivariate linear regression model for predicting the box office revenue. The general 
form of a regression model at time t before release, based on a set of predictor variables S is 

y= J^Xj (*)+Cj(*) + (2) 

Cj (t) is the constant of the linear model and Sj (t) is the residual of the regression. We feed the model with 
different combinations of predictor variables and characterize the goodness of different sets by calculating 
the coefficient of determination R 2 (t). The coefficient of determination is calculated using 10- fold cross- 
validation (See Methods section). Temporal evolution of R 2 (t) is shown for different predictor sets S 
in Fig. [3] While a model employing {T}, can be seen as a benchmark of the state of the art in real 
market predictions, the model solely fed by {V} predicts roughly as well as that. Combinations of {V, T} 
and {[/, T} score well above this benchmark indicating the relevance of activity measures for prediction. 
Among all sets considered (not shown here), {V, U, R, E, T} yields the highest coefficient of determination, 
which reaches 0.77 around a month before the movie release. 
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Figure 1. Histograms of different variables for our sample of n — 312 movies from 2010. 

A: Time of creation t c of the corresponding article in Wikipedia, shown in movie time (t = is the 
release time), B: Release weekend box office revenue in the U. S., C: number of theaters that screened 
the movie on the first weekend, D: Accumulated number of views, and E: users, F: edits, G: rigor for 
the Wikipedia page up to t = 7 days after release. 



Discussion 



Results presented above clearly show how simple use of user generated data in a social environment like 
Wikipedia could enhance our ability to predict the collective reaction of society to a cultural product. 
Here we compare the predictive model based on Wikipedia activity measures with the results of the 
Twitter-based model provided in the 2010 study of Asur and Huberman [20] . 

Asur and Huberman use a sample of 24 movies to train and test their model. In the same approach we 
train and test our model focusing on the same set of movies. The R 2 (t) of our Wikipedia model reaches 
0.94 few days before release, while 0.98 for the Twitter model. However, presented results of the Twitter 
model is limited to the night of the release, while the model presented here can make predictions with 
reasonable determinations (R 2 > 0.925) as early as one month before release (See Fig. |4|. One should 
also bear in mind that the Wikipedia model does not require any complex content analysis and only relies 
on statistical measures of activity level. 

Fig.[5]shows the actual revenue of movies in the sample against the predicted revenue at t = —30 days. 
It is evident that the prediction is more precise for more successful movies. When less successful movies 
are considered, deviations from the diagonal line (perfect prediction) increase. Since most of the movies 
predicted by the Twitter model are among the successful ones, applicability of the model on movies with 
medium and low popularity levels remains an open question. 

While we tried to keep the model as simple as possible and only based on few variables, one could 
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Figure 2. Temporal evolution of rj(t), the Pearson correlation of the box office revenue 
with different predictors. Time is measured in movie time. Inset: magnified detail of the main 
panel, showing the Pearson correlation around the day of release. Dashed horizontal line shows the 
correlation for the number of theaters. 



possibly enhance the efficiency of prediction by applying more sophisticated statistical methods, such 
as neural networks on more detailed content-related parameters such as the controversy measure of the 
article 29 



Methods 

In this study we consider a sample of 312 movies that were released in the United States in 2010. The 
complete dataset including the financial data as well as Wikipedia activity records is available via the 
Supplementary Data SI. To obtain this dataset, first the list of 2010 movies distributed in the U. S. is 
acquired from Box Office Mojo (http://boxofficemojo.com) along with their accompanying financial 
data (535 movies). Financial data consist of the release weekend box office revenue and the number of 
theaters, screening the movie. 

In order to locate the corresponding articles in Wikipedia, we use the category system of Wikipedia. 
Wikipedia articles are classified into one or more categories by users. We match the title of the movies in 
the Mojo database with the title of Wikipedia pages in categories 2009 films and 2010 films. Inclusion 
of the category 2009 films is necessary because of movies that were released in 2010 in the U.S., but 
already entered the international market during 2009. Consequently, those movies are classified in the 
category 2009 films in Wikipedia. To achieve the best possible match of the titles, they were stripped 
of punctuation and postfixes. Wikipedia uses the latter to maintain the uniqueness of every title, such 
as in the case of Avatar (2009 film) and Avatar (computing). As a result of the matching process 
described above, a sample consisting of the financial data and the corresponding Wikipedia page for 312 
movies was obtained. 

For the sake of convenience we introduce movie time, a common time coordinate for the movies in 
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Figure 3. Coefficient of determination of the multivariate linear regression model fed by 
different set of input variables. The shorthands V, U, R, E, and T denote the number of views, the 
number of users, the rigor, the number of edits, and the number of theaters, respectively. The coefficient 
of determination was calculated using 10- fold cross-validation (see the Methods section). The dashed 
gray line shows the coefficient of determination for linear regression solely based on the number of 
theaters. 




Figure 4. Comparison of the results with the Twitter-based prediction in Asur and 
Huberman work |20|. Same sample of 24 movies is considered as both training and test set. The 
coefficient of determination obtained with the Twitter-based method is 0.98 at the night of the release 
(day in movie time). 



the scope of our study. By definition, movie time is measured from the time of release in the U. S. All 
temporal variables are measured in movie time. Throughout this study, we consider accumulated values 
of parameters from the inception of the article to the prediction time of t for each activity measure. The 
four activity measures are defined as the following: 
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Figure 5. First weekend box office revenue in the U. S. against its predicted value at 
t — — 30 days. Green dots are representing the smaller sample of 24 movies common in Twitter and 
Wikipedia studies, and black dots are movies from the 2010 sample of 312 movies. Note that negative 
predicted revenues for some of the very unpopular movies could not be shown in the logarithmic scale. 



Number of users: the number of different human users that contributed to the page. 

Number of edits: the number of modifications made by human users on the article. 

Collaborative rigor: similar to the number of edits, however, it counts multiple subsequent edits by 
the same user as one edit [43] . It avoids counting multiple edits by the same user in a short period, e.g., 
to correct errors in their previous contribution. A schematic illustration of activity measures is presented 
in Fig. [6j These three variables are calculated using the page history databases of Wikimedia Toolserver 
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edits: 9, rigor: 6, users: 3 



Figure 6. Illustration of different variables characterizing the activity of Wikipedia editors 
on an article. Each tick on the axis represents a modification of the page. Different tick styles refer to 
different users. 

( ! http://toolserver .wikimedia.org| ), which register information about every modification made to the 
pages of Wikipedia. To ensure that the above variables count solely human activity contributions made 
by bots are excluded from calculations. Bots are automated scripts which facilitate automatic tasks such 
as spell checking. Contributions made by bots are registered in the same way as revisions by humans, 
however can be distinguished from human activity using a special entry in the databases of Wikimedia 
Toolserver, called the bot flag. 

Number of views: the number of times the given page is viewed from its inception up to the time 
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of t. This data is extracted from the page view statistics section of the Wikimedia Downloads site 
( !http: //dumps. wikimedia.org/other/pagecounts-raw!) through the web-based interface of "Wikipedia 
article traffic statistics" ( ! http://stats.grok.se!). Wikimedia Downloads counts views only since December 
2007 and the view count data for July 2008 is corrupted. Therefore it is impossible to count the exact 
total number of views till the time of prediction for all considered pages. We have counted the page hits 
from t — —500 days before release, which according to Fig. [ljA), is sufficiently early. Another challenge 
is created by the renaming of the articles, shopping page hit counts into subsets according to the various 
titles the page possesses throughout its history. To cope with this problem, we followed the logs of "title 
moves" in the article history to track back and merge the whole page hits. Number of theaters, the count 
of movie theaters that screen the movie on the first weekend of its release, is extracted from Box Office 
Mojo. 

To calculate the coefficient of determination, we carry out 10-fold cross-validation by randomly di- 
viding our sample of 2010 movies into 10 subsets first. In the next step the model is trained for the 
union of the 9 subsets and tested on the remaining 10th subset. This is repeated for all 10 permutations 
of the subsets and the coefficient of determination for the model is obtained as the averaged over the 
permutations. 
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Supporting Information 

Figure SI: 

Examples of temporal evolution of Wikipedia-based predictors. 
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Figure SI. Temporal evolution of Wikipedia-based predictors for two individual movies: 

The Wolfman (2010) and MacGruber. The daily increments of number of views AV and number of 
users AU are shown for the articles in English Wikipedia that correspond to the two movies. The 
temporal axis shows movie time, i.e., a time- frame in which t = corresponds to the release date. The 
Wolfman earned a box office revenue of $31,479, 235 on the release weekend while MacGruber gained 
only $4, 043, 495. Accordingly, predictor variables take larger values in the case of The Wolfman. 



Data SI: 

The dataset under study, including the financial and Wikipedia activity data is available at 
http : / /wwm . phy . bme . hu/SupplementaryDataSl . zip 



